Skip to content

fix: worker backlog#873

Merged
johnyeocx merged 1 commit intomainfrom
fix/worker-backlog
Mar 3, 2026
Merged

fix: worker backlog#873
johnyeocx merged 1 commit intomainfrom
fix/worker-backlog

Conversation

@johnyeocx
Copy link
Collaborator

@johnyeocx johnyeocx commented Mar 3, 2026

Summary by cubic

Prevents SQS worker backlog by timing out stuck jobs, running migration jobs without blocking the poller, and recycling/respawning workers more aggressively. Also stabilizes a billing migration test with a fixed invoice count.

  • Bug Fixes
    • Enforce 25s per-message processing timeout (below 30s visibility); migration jobs are exempt.
    • Split migration jobs from the poll loop and run them fire-and-forget; log and send failures to Sentry.
    • Add idle self-kill after 5 zero-message intervals (only after some work done) to trigger cluster respawn.
    • Lower recycle threshold from 500k to 50k messages to reduce memory buildup.
    • Update migrate-paid test to assert a fixed invoice count (2) for stability.

Written for commit 29b208b. Summary will update on new commits.

Greptile Summary

This PR addresses worker backlog issues by introducing idle self-kill, per-message timeouts, and fire-and-forget dispatch for migrations. The recycle threshold is also reduced from 500k to 50k messages.

Key changes:

  • Bug fixes — Reduced MAX_MESSAGES_BEFORE_RECYCLE from 500,000 → 50,000 to recycle workers more aggressively and prevent memory buildup.
  • Improvements — Added IDLE_SELF_KILL_THRESHOLD (5 intervals): workers exit cleanly after 5 minutes of zero messages, allowing cluster respawn.
  • Improvements — Added MESSAGE_TIMEOUT_MS = 25_000 and withTimeout helper: non-migration messages that stall beyond 25s raise an error and release the polling loop rather than blocking indefinitely.
  • Improvements — Migration jobs are now dispatched fire-and-forget, preventing them from blocking the regular message batch since they're already deleted from the queue before processing.
  • Bug fixes — Updated migrate-paid-2 test to hardcode count: 2 instead of dynamically capturing invoiceCountBefore.

Issues identified:

  1. The idle self-kill logic can terminate the worker while long-running fire-and-forget migration jobs are still executing in the background, since metrics are only updated upon job completion.
  2. The test change to hardcoded invoice count weakens the "no new invoice created" invariant, making it inconsistent with sibling tests that use dynamic capture.

Confidence Score: 2/5

  • PR has two issues that reduce safety: a production correctness race condition in the idle self-kill logic with fire-and-forget migrations, and a test quality regression that weakens the migration invariant check.
  • The idle self-kill mechanism can trigger while long-running migration jobs are still executing in the background, potentially killing workers mid-operation. Additionally, the test change to hardcoded invoice counts weakens the critical invariant check and is inconsistent with sibling tests. Both issues should be addressed before merge.
  • server/src/queue/initWorkers.ts (production correctness), server/tests/integration/billing/migrations/migrate-paid.test.ts (test quality regression)

Sequence Diagram

sequenceDiagram
    participant PL as Polling Loop
    participant Q as SQS Queue
    participant H as handleSingleMessage
    participant P as processMessage

    PL->>Q: ReceiveMessageCommand (up to 10 msgs)
    Q-->>PL: messages[]

    loop For each message
        PL->>PL: Parse job.name
        alt job.name === Migration
            PL-->>H: fire-and-forget (no await)
            H->>Q: DeleteMessageCommand (immediately)
            H->>P: processMessage (no timeout)
            Note over H,P: Long-running, already deleted from queue
        else Regular message
            PL->>PL: push to regularMessages[]
        end
    end

    PL->>H: Promise.allSettled(regularMessages)
    H->>P: withTimeout(processMessage, 25s)
    P-->>H: result
    H-->>PL: {id, receiptHandle} | null

    PL->>Q: DeleteMessageBatchCommand(toDelete)

    Note over PL: Every 60s: logStatsAndCheckZeroMessages()
    alt messagesProcessed == 0 for 5 intervals AND totalProcessed > 0
        PL->>PL: process.exit(0) [⚠ RACE: may kill while migrations running]
    end
    alt totalMessagesProcessed >= 50,000
        PL->>PL: process.exit(0) [recycle]
    end
Loading

Last reviewed commit: 29b208b

Context used:

  • Rule from dashboard - CLAUDE.md (source)
  • Context from dashboard - When generating the key changes section of the summary, please tag each change with one or many of t... (source)

@johnyeocx johnyeocx requested a review from ay-rod as a code owner March 3, 2026 17:33
@vercel
Copy link

vercel bot commented Mar 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
autumn-vite Ready Ready Preview, Comment Mar 3, 2026 5:34pm

Request Review

@johnyeocx johnyeocx merged commit 43d03ad into main Mar 3, 2026
8 of 11 checks passed
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Confidence score: 4/5

  • This PR is likely safe to merge, with only a small test-quality risk rather than a production behavior issue.
  • In server/tests/integration/billing/migrations/migrate-paid.test.ts, hardcoding invoice count to 2 can miss regressions in “no new invoice created” logic if fixture/setup counts change, making the test brittle.
  • Severity is moderate-low (4/10) and scoped to test robustness, so merge risk is minimal but worth tightening soon by comparing against a captured pre-migration count.
  • Pay close attention to server/tests/integration/billing/migrations/migrate-paid.test.ts - the fixed expected count can hide migration regressions when baseline data changes.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/tests/integration/billing/migrations/migrate-paid.test.ts">

<violation number="1" location="server/tests/integration/billing/migrations/migrate-paid.test.ts:204">
P2: Hardcoding the invoice count to 2 no longer verifies “no new invoice created” and makes the test brittle if the initial invoice count changes. Capture the pre-migration invoice count and compare against it instead.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

await expectCustomerInvoiceCorrect({
customer,
count: invoiceCountBefore,
count: 2,
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Hardcoding the invoice count to 2 no longer verifies “no new invoice created” and makes the test brittle if the initial invoice count changes. Capture the pre-migration invoice count and compare against it instead.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/tests/integration/billing/migrations/migrate-paid.test.ts, line 204:

<comment>Hardcoding the invoice count to 2 no longer verifies “no new invoice created” and makes the test brittle if the initial invoice count changes. Capture the pre-migration invoice count and compare against it instead.</comment>

<file context>
@@ -202,7 +201,7 @@ test.concurrent(`${chalk.yellowBright("migrate-paid-2: allocated seats with pric
 	await expectCustomerInvoiceCorrect({
 		customer,
-		count: invoiceCountBefore,
+		count: 2,
 	});
 
</file context>
Fix with Cubic

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +88 to +96
if (
consecutiveZeroMessageIntervals >= IDLE_SELF_KILL_THRESHOLD &&
totalMessagesProcessed > 0
) {
console.log(
`${logPrefix()} Idle self-kill: 0 messages for ${consecutiveZeroMessageIntervals} intervals after processing ${totalMessagesProcessed} total. Exiting for cluster respawn.`,
);
process.exit(0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idle self-kill can trigger while fire-and-forget migration jobs are still executing.

Migration jobs are dispatched with .catch() without being awaited (line 275), and messagesProcessed only increments when handleSingleMessage completes. If a migration job runs for 5+ minutes and no regular messages are processed in the interim, consecutiveZeroMessageIntervals reaches 5, causing process.exit(0) while the migration is still mid-flight.

Consider tracking active migration job count:

let activeMigrationJobs = 0;

// When dispatching a migration job (line 275):
activeMigrationJobs++;
handleSingleMessage({ sqs, message, db })
  .catch((error) => {
    console.error(...);
    Sentry.captureException(error);
  })
  .finally(() => activeMigrationJobs--);

// In idle self-kill check (line 88):
if (
  consecutiveZeroMessageIntervals >= IDLE_SELF_KILL_THRESHOLD &&
  totalMessagesProcessed > 0 &&
  activeMigrationJobs === 0
) {
  // safe to exit
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: server/src/queue/initWorkers.ts
Line: 88-96

Comment:
Idle self-kill can trigger while fire-and-forget migration jobs are still executing.

Migration jobs are dispatched with `.catch()` without being awaited (line 275), and `messagesProcessed` only increments when `handleSingleMessage` completes. If a migration job runs for 5+ minutes and no regular messages are processed in the interim, `consecutiveZeroMessageIntervals` reaches 5, causing `process.exit(0)` while the migration is still mid-flight.

Consider tracking active migration job count:

```ts
let activeMigrationJobs = 0;

// When dispatching a migration job (line 275):
activeMigrationJobs++;
handleSingleMessage({ sqs, message, db })
  .catch((error) => {
    console.error(...);
    Sentry.captureException(error);
  })
  .finally(() => activeMigrationJobs--);

// In idle self-kill check (line 88):
if (
  consecutiveZeroMessageIntervals >= IDLE_SELF_KILL_THRESHOLD &&
  totalMessagesProcessed > 0 &&
  activeMigrationJobs === 0
) {
  // safe to exit
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 201 to 205
// CRITICAL: No new invoice created
await expectCustomerInvoiceCorrect({
customer,
count: invoiceCountBefore,
count: 2,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded invoice count weakens the migration invariant test.

Previously, this test captured invoiceCountBefore and asserted count: invoiceCountBefore, directly verifying "the migration does not create new invoices." By hardcoding count: 2, the test now only checks "there are exactly 2 invoices at this point" — a different guarantee that couples to test setup state.

This is inconsistent with migrate-paid-1 (line 110), migrate-paid-3 (line 298), and migrate-paid-4 (line 396), which all use the dynamic invoiceCountBefore pattern. If the test setup changes or produces an additional invoice earlier, the assertion will silently pass even if the migration itself created an extra invoice.

Consider reverting to the dynamic pattern to maintain the CRITICAL invariant:

Suggested change
// CRITICAL: No new invoice created
await expectCustomerInvoiceCorrect({
customer,
count: invoiceCountBefore,
count: 2,
});
// Verify initial state
let customer = await autumnV1.customers.get<ApiCustomerV3>(customerId);
const invoiceCountBefore = customer.invoices?.length ?? 0;
// ... later ...
await expectCustomerInvoiceCorrect({
customer,
count: invoiceCountBefore,
});
Prompt To Fix With AI
This is a comment left during a code review.
Path: server/tests/integration/billing/migrations/migrate-paid.test.ts
Line: 201-205

Comment:
Hardcoded invoice count weakens the migration invariant test.

Previously, this test captured `invoiceCountBefore` and asserted `count: invoiceCountBefore`, directly verifying "the migration does not create new invoices." By hardcoding `count: 2`, the test now only checks "there are exactly 2 invoices at this point" — a different guarantee that couples to test setup state.

This is inconsistent with `migrate-paid-1` (line 110), `migrate-paid-3` (line 298), and `migrate-paid-4` (line 396), which all use the dynamic `invoiceCountBefore` pattern. If the test setup changes or produces an additional invoice earlier, the assertion will silently pass even if the migration itself created an extra invoice.

Consider reverting to the dynamic pattern to maintain the CRITICAL invariant:

```suggestion
// Verify initial state
let customer = await autumnV1.customers.get<ApiCustomerV3>(customerId);
const invoiceCountBefore = customer.invoices?.length ?? 0;
// ... later ...
await expectCustomerInvoiceCorrect({
  customer,
  count: invoiceCountBefore,
});
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant